Summarizing statistical data for database systems and/or environments

ABSTRACT

Database values and their associated indicators can be arranged in multiple “buckets.” Adjacent buckets can be combined into a single bucket successively based one or more criteria associated with the indicators to effectively reduce the number of buckets until a desired number is reached.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of and takes priority from the U.S.patent application Ser. No. 14/087,278, entitled: “SUMMARIZINGSTATISTICAL DATA FOR DATABASE SYSTEMS AND/OR ENVIRONMENTS,” by Luo etal., filed on Nov. 22, 2013, which is hereby incorporated by referenceherein in its entirety and for all purposes.

BACKGROUND

Data can be an abstract term. In the context of computing environmentsand systems, data can generally encompass all forms of informationstorable in a computer readable medium (e.g., memory, hard disk). Data,and in particular, one or more instances of data can also be referred toas data object(s). As is generally known in the art, a data object can,for example, be an actual instance of data, a class, a type, or aparticular form of data, and so on.

Generally, one important aspect of computing and computing systems isstorage of data. Today, there is an ever increasing need to managestorage of data in computing environments. Databases provide a very goodexample of a computing environment or system where the storage of datacan be crucial. As such, to provide an example, databases are discussedbelow in greater detail.

The term database can also refer to a collection of data and/or datastructures typically stored in a digital form. Data can be stored in adatabase for various reasons and to serve various entities or “users.”Generally, data stored in the database can be used by one or more the“database users.” A user of a database can, for example, be a person, adatabase administrator, a computer application designed to interact witha database, etc. A very simple database or database system can, forexample, be provided on a Personal Computer (PC) by storing data (e.g.,contact information) on a Hard Disk and executing a computer programthat allows access to the data. The executable computer program can bereferred to as a database program, or a database management program. Theexecutable computer program can, for example, retrieve and display data(e.g., a list of names with their phone numbers) based on a requestsubmitted by a person (e.g., show me the phone numbers of all my friendsin Ohio).

Generally, database systems are much more complex than the example notedabove. In addition, databases have been evolved over the years and areused in various business and organizations (e.g., banks, retail stores,governmental agencies, universities). Today, databases can be verycomplex. Some databases can support several users simultaneously andallow them to make very complex queries (e.g., give me the names of allcustomers under the age of thirty five (35) in Ohio that have bought allthe items in a given list of items in the past month and also havebought a ticket for a baseball game and purchased a baseball hat in thepast 10 years).

Typically, a Database Manager (DBM) or a Database Management System(DBMS) is provided for relatively large and/or complex databases. Asknown in the art, a DBMS can effectively manage the database or datastored in a database, and serve as an interface for the users of thedatabase. For example, a DBMS can be provided as an executable computerprogram (or software) product as is also known in the art.

It should also be noted that a database can be organized in accordancewith a Data Model. Some notable Data Models include a Relational Model,an Entity-relationship model, and an Object Model. The design andmaintenance of a complex database can require highly specializedknowledge and skills by database application programmers, DBMSdevelopers/programmers, database administrators (DBAs), etc. To assistin design and maintenance of a complex database, various tools can beprovided, either as part of the DBMS or as freestanding (stand-alone)software products. These tools can include specialized Databaselanguages (e.g., Data Description Languages, Data ManipulationLanguages, Query Languages). Database languages can be specific to onedata model or to one DBMS type. One widely supported language isStructured Query Language (SQL) developed, by in large, for RelationalModel and can combine the roles of Data Description Language, DataManipulation Language, and a Query Language.

Today, databases have become prevalent in virtually all aspects ofbusiness and personal life. Moreover, usage of various forms ofdatabases is likely to continue to grow even more rapidly and widelyacross all aspects of commerce, social and personal activities.Generally, databases and DBMS that manage them can be very large andextremely complex partly in order to support an ever increasing need tostore data and analyze data. Typically, larger databases are used bylarger organizations, larger user communities, or device populations.Larger databases can be supported by relatively larger capacities,including computing capacity (e.g., processor and memory) to allow themto perform many tasks and/or complex tasks effectively at the same time(or in parallel). On the other hand, smaller databases systems are alsoavailable today and can be used by smaller organizations. In contrast tolarger databases, smaller databases can operate with less capacity.

A current popular type of database is the relational database with aRelational Database Management System (RDBMS), which can includerelational tables (also referred to as relations) made up of rows andcolumns (also referred to as tuples and attributes). In a relationaldatabase, each row represents an occurrence of an entity defined by atable, with an entity, for example, being a person, place, thing, oranother object about which the table includes information.

One important objective of databases, and in particular a DBMS, is tooptimize the performance of queries for access and manipulation of datastored in the database. Given a target environment, an “optimal” queryplan can be selected as the best option by a database optimizer (oroptimizer). Ideally, an optimal query plan is a plan with the lowestcost (e.g., lowest response time, lowest CPU and/or I/O processing cost,lowest network processing cost). The response time can be the amount oftime it takes to complete the execution of a database operation,including a database request (e.g., a database query) in a given system.In this context, a “workload” can be a set of requests, which mayinclude queries or utilities, such as, load that have some commoncharacteristics, such as, for example, application, source of request,type of query, priority, response time goals, etc.

Generally, data (or “Statistics”) can be collected and maintained for adatabase. “Statistics” can be useful for various purposes and forvarious operational aspects of a database. In particular, “Statistics”regarding a database can be very useful in optimization of the queriesof the database, as generally known in the art.

In view of the prevalence of databases in various aspects life today andimportance of collection of Statistics in operating various databases,it is apparent that techniques relating to database Statistics databasesare very useful.

SUMMARY

Broadly speaking, the invention relates to computing environments andsystems. More particularly, the invention relates to summarizinginformation for databases.

In accordance with one aspect of the invention, database values (e.g.,column values of a database table) and one or more indicator valuesassociated with them (e.g., frequencies of occurrences of column valuesin a database table) can be arranged, for example, in multiple“buckets.” Then, the adjacent buckets in the arrangement can be combinedinto a single bucket successively based on one or more criteria toeffectively reduce the total number of buckets until a desired totalnumber of buckets is reached. The one or more criteria that are used forcombining the buckets can be associated with the indicator values inorder to provide a summary (e.g., histogram) of the database values andtheir indicator values that can generally relay information about thedatabase values. The one or more criteria can, for example, beassociated with the indicator values (e.g., proximity of frequency ofoccurrences) to provide a summary that effectively combines similarinformation together and attempts to minimize the error in order toprovide an accurate summary. By way of example, adjacent buckets can becombined based on a constraint associated with the differences betweentheir indicative values (e.g., an error value measured based on thedifferences between the frequencies of occurrences for adjacentbuckets).

Other aspects and advantages of the invention will become apparent fromthe following detailed description, taken in conjunction with theaccompanying drawings, illustrating by way of example the principles ofthe invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings,wherein like reference numerals designate like structural elements, andin which:

FIG. 1 depicts a statistical summarizer in a computing environment inaccordance with one embodiment of the invention.

FIGS. 2A, 2B and 2C depict a simplified distribution of distinct valuesrepresentative of statistical data that can be summarized by statisticalsummarizer for a database in accordance with one embodiment of theinvention.

FIG. 3A depicts a resulting summary of statistical data that can beobtained in accordance with one embodiment.

FIGS. 3B and 3C depicts resulting summary of statistical data that canbe obtained using conventional techniques.

FIG. 4 depicts a method 400 of storing database values and theirassociated indicator values in a summarized form in accordance with oneembodiment of the invention.

FIG. 5 depicts a method 500 storing frequencies of column values in asummarized form for a table of a database in accordance with oneembodiment of the invention.

FIG. 6 depicts a database node of a database system or a DatabaseManagement System (DBMS) in accordance with one embodiment of theinvention.

FIGS. 7 and 8 depict a parsing engine in accordance with one embodimentof the invention.

DETAILED DESCRIPTION

As noted in the background section, techniques relating to databaseStatistics databases are very useful.

To further elaborate, database systems can use histograms to group tablecolumn values into “buckets” according to their frequency distributionas a summary. This summary can then be used to estimate the selectivityof queries in query optimization phase. Equal-width and Equal-depthhistograms are known in the art.

Also, a variation of High Biased Histogram (HBH) has been used. Inconcept, it removes the most frequently occurring values (i.e.,high-biased values) by keeping them in dedicated buckets. A fixed numberof buckets for histogram, for example, 250, can be assumed. This numbermay be adjusted over time. Each bucket can, for example, eitherrepresent a range of values and their average frequency or can representtwo high-biased values and their actual frequencies. For example, if 100out of 250 buckets are used for high-biased values, we can save 200high-biased values. Then, the rest of values can be non-high-biased andcan be represented, using Equal-depth histogram, in the left 150buckets.

Generally, HBH can perform well. However, if the data has more valueswith high frequencies than the high-biased value buckets and there isalso variance among the frequencies of non-high-biased values, problemscan be encountered. At least from this perspective, it would be veryuseful to have another kind of histogram that can effectively serve as acomplement to HBH.

More recently, V-Optimal histograms (VOH) have been developed asarguably the state-of-art approach for generating histograms fordatabases. Generally, V-Optimal histograms search the best bucketboundaries for grouping values to minimize the accumulated variancebetween the actual frequency and the estimated frequency of each value.However, in practice, there are two problems for VOH. One is thecomputation cost, given that finding the global optimal bucketboundaries is a NP-hard problem. Another problem is that the traditionalV-Optimal histograms are typically constructed from data only, so theconstruction process may not account for the characteristics of theapplication workload or data access patterns.

Accordingly, there is a need for alternative techniques for summarizingdata or statistics of databases.

Hence, it will be appreciated that the described techniques, among otherthings, can be used to provide a Constrained V-Optimal Histogram (CVOH)as an alternative technique for summarizing data or statistics ofdatabases.

Generally, the CVOH can cost less to implement and can also be tailoredbased on various criteria, including, for example, the characteristicsof the application workload and its data access pattern Generally,database values (e.g., column values of a database table) and one ormore indicator values associated with them (e.g., frequencies ofoccurrences of column values in a database table) can be arranged, forexample, in multiple “buckets.” Then, the adjacent buckets in thearrangement can be combined into a single bucket successively based onone or more criteria to effectively reduce the total number of bucketsuntil a desired total number of buckets is reached in accordance withone aspect of the invention. The one or more criteria that are used forcombining the buckets can be associated with the indicator values inorder to provide a summary (e.g., histogram) of the database values andtheir indicator values that can generally relay information about thedatabase values. The one or more criteria can, for example, beassociated with the indicator values (e.g., proximity of frequency ofoccurrences) to provide a summary that effectively combines similarinformation together and attempts to minimize the error in order toprovide an accurate summary. By way of example, adjacent buckets can becombined based on a constraint associated with the differences betweentheir indicative values (e.g., an error value measured based on thedifferences between the frequency of occurrences for adjacent buckets).

Embodiments of these aspects of the invention are also discussed belowwith reference to FIGS. 1-8. However, those skilled in the art willreadily appreciate that the detailed description given herein withrespect to these figures is for explanatory purposes as the inventionextends beyond these limited embodiments.

FIG. 1 depicts a statistical summarizer 102 in a computing environment104 in accordance with one embodiment of the invention. The statisticalsummarizer 102 can provide a summary 100 for statistical data inaccordance with one or more criteria that can, for example, be providedas input parameters 108 to the statistical summarizer 102. As suggestedby FIG. 1, the statistical summarizer 102 can, for example, be providedas a part of a Database Management System (DBMS) 104 for a database 106.As such, the summary 100 can, for example, represent summary ofstatistics for a database (e.g., a histogram for column values of atable in a database). In the context summary of statistics for adatabase, the input parameters 108, can, for example, specifying one ormore parameters and/or conditions for the summary (e.g., summarize infour different categories, ranges, etc.)

By way of example, a number of values (X₁-X_(N)) in the database caneach be respectively associated with one or more indicators (F₁-F_(N))in the database 106. The values (X₁-X_(N)) can, for example, be columnvalues and indicators (F₁-F_(N)) can, for example, represent frequenciesof occurrences respectively for the column values (X₁-X_(N)). In thisexample, an input parameter 108 can, for example, indicate to thestatistical summarizer 102 that a summary is to be provided with onlythree (3) ranges of values for tens or hundreds of column values(X₁-X_(N)) in the database 106.

Referring to FIG. 1, data 110 of the can be presented as data 110 _(!)with a number of “buckets” B₁-B_(N), where each bucket B_(i) consists ofa value X_(i) and its associated indicator value(s) F_(i). In order toprovide the summary 100, the statistical summarizer 102 may optionallycombine any adjacent buckets that have the same indicator value inaccordance with one embodiment of the invention. However, generally, thestatistical summarizer 102 combines adjacent buckets together togenerate the summary 100. Referring back to FIG. 1, the statisticalsummarizer 102 can combine two adjacent buckets B_(i) and B_(i+1)together to form a single bucket B_((i, i+1)) in an intermediateoperation or phase 120. It will be appreciated that the statisticalsummarizer 102 can select adjacent buckets to be combined in a mannerthat would allow the summary 100 to be provided in accordance with oneor more desired conditions.

For example, in order to provide a histogram for frequencies for columnvalues, adjacent buckets can be selected based on the proximity of theirassociated frequencies. In other words, two adjacent buckets that havethe least difference between their indicators can be selected to becombined into one (or a single) bucket. The selection process can berepeated using an average value for the combined bucket. As a result,two other adjacent buckets can be selected to be combined andrepresented into one bucket. Buckets can be combined until a desirednumber of buckets have been achieved. Referring to FIG. 1, after one ormore intermediate phases 120, a final phase 122 can yield the desirednumber of buckets, namely M buckets (B₁-B_(M)).

To further elaborate, FIG. 2A depicts a simplified distribution ofdistinct values 200 representative of statistical data that can besummarized by statistical summarizer 102 (shown in FIG. 1) for adatabase in accordance with one embodiment of the invention. In otherwords, the distribution of distinct values 200 can represent an exampleof data or statistical data 110 (shown in FIG. 1) that can be summarizedby the statistical summarizer 102 (also shown in FIG. 1). In thisexample, the statistical data 110 is to be ultimately summarized intofive (5) buckets in a manner that would minimize the differences (orerror) between the values grouped into each of the five (5) buckets.

Referring to FIG. 2A, the distinct values are represented as integersfrom one (1) to sixteen (16), wherein each integer value is associatedwith an indicator value represented as a bar. Hence, the first value,namely, one (1), is associated with an indicator value of seven (7) andthe second value, namely, two (2), is associated with an indicator valueof nine (9) and so on. It should be noted that values are arranged in anascending order from left to right. It should also be noted that thestatistical summarizer 102 (shown in FIG. 1) can be configured toarrange the values as such or be provided the values already arranged inthe order depicted in FIG. 2A. Although not shown in FIG. 2A, the valuescan be considered to be in single buckets, where each value would be inits own bucket (i.e., sixteen (16) buckets where each bucket holds onlyone value and its indicator).

Given the criteria of minimizing error in this example, the statisticalsummarizer 102 (shown in FIG. 1) can proceed to group adjacent values(or buckets) that have equal indicator values. As a result, values ofeight (8) and nine (9) can be group together into a single bucket asrepresented in FIG. 2B by the lines drawn around them. Referring to FIG.2B, values of thirteen (13) and fourteen (14) can also be groupedtogether into a single bucket as represented by lines drawn around them.As a result, the original sixteen (16) buckets can be reduced tofourteen (14) buckets.

Thereafter, the statistical summarizer 102 (shown in FIG. 1) can proceedto select two or more other adjacent buckets to be combined such thedifference between their indicators would be minimized in comparisons tothe other adjacent buckets that could be combined. In other words,statistical summarizer 102 (shown in FIG. 1) can determine thecombination of which two buckets would yield the minimum error as ameasure of the difference between their indicators. The statisticalsummarizer 102 can continue to select two or more other buckets andcombine them based on the minimum error criteria as a measure of thedifference between their indicators until ultimately the desired numberof buckets, namely, five (5) is achieved. The intermediate selectionsare further explained below, by way of example, for a square errormeasurement as a criterion for the selection of the adjacent buckets tobe combined.

The resulting five (5) buckets are depicted in FIG. 2C as buckets B1,B2, B3, B4 and B5 where they can serve as a summary for the distributionof the distinct values 200. For example, the first bucket can representa range of values between one (1) and five (5), where the sum or theaverage of all indicators can be used as statistical data for values inthat range, and so on.

To further elaborate, FIG. 3A also depicts the resulting five (5)buckets (also shown in FIG. 2C) as buckets B1, B2, B3, B4 and B5. Inother words, FIG. 3A depicts the resulting summary 302 that can beobtained in accordance with one embodiment of the invention. Referringto FIG. 3A, to provide an example, average of the indicators are shownas an “AvgFreq,” reprehensive of frequencies of occurrence of values ina database, and the error values as a Square Error measurement (“SqErr”)in accordance with one embodiment of the invention, namely, aConstrained V-optimal Histogram (CVOH). As shown in FIG. 3A, theConstrained V-optimal Histogram (CVOH) technique can yield a totalsquare error value of “11.12.”

In contrast to FIG. 3A, FIGS. 3B and 3C respectively depict the resultthat can be achieved by an Equal-depth Histogram and a High BiasedHistogram techniques to summarize the same data, namely, a simplifieddistribution of distinct values 200 (shown in FIG. 2A). Referring toFIGS. 3B and 3C, respectively, total square errors of “70.4” and “53.7”can be achieved by an Equal-depth Histogram and a High Biased Histogramtechniques, whereas, a Constrained V-optimal Histogram (CVOH) techniquecan yield a total square error of “11.12” which is significantly lowerand more desirable since in essence more similar values can be groupedtogether to provide a more accurate and thus a more useful summary ofthe distribution of distinct value 200 as a simplified example ofstatistical data of a database.

To further elaborate, FIG. 4 depicts a method 400 of storing databasevalues and their associated indicator values in a summarized form inaccordance with one embodiment of the invention. Method 400 can, forexample, be used by the statistical summarizer 102 shown in FIG. 1.Referring to FIG. 4, initially, database values of a database (e.g.,column values of a database table) are arranged (402) in multiplebuckets in accordance with an order in an arrangement (e.g., in anascending order). It should be noted that each one of the databasevalues is associated with an indicator value (e.g., frequency ofoccurrence of the database column value) and each one of the multiplebuckets includes only one of the database values with its associatedindicator. In other words, initially, each bucket had only one databasevalue and its associated indicator. Next, it is determined (404) whetherto reduce the number of buckets. By way of example, it can be determined(404) whether a particular value, namely a desired total number ofbuckets indicative of total number of buckets has been reached.Accordingly, method 400 can continue to combine (406) two adjacentbuckets in the arrangement into a combined bucket based one or morecriteria associated with the indicator values until it is determined(404) not to further reduce the number of buckets, for example, untilthe desired total number of buckets has been reached. The method 400ends when it is determined (404) not to reduce the number of buckets,for example, as it can be determined (404) that a desired total numberof buckets has been reached.

To elaborate even further, FIG. 5 depicts a method 500 storingfrequencies of column values in a summarized form for a table of adatabase in accordance with one embodiment of the invention. Method 500can, for example, be used by the statistical summarizer 102 shown inFIG. 1. Referring to FIG. 5, initially, column values with theirassociated frequencies are stored (502) as buckets in an order inaccordance with their column values in an arrangement. Next, it isdetermined (504) if any adjacent buckets in the arrangement have anequal frequency associated with their column values. Accordingly, one ormore adjacent buckets in the arrangement that have an equal frequencyassociated with their column values can be combined (506) into a singlebucket. Thereafter, it is determined (508) whether to reduce the totalnumber of buckets to reach a maximum allowed number of buckets. As aresult, two adjacent buckets can be selected (510) for combiningtogether as a single bucket based on an error condition associated withthe difference between the frequencies of their column values. Theselected buckets can be combined (512) into a single bucket. In effect,the method 500 can continue to select (510) and combine (512) twoadjacent buckets based on the error condition associated with thedifference between the frequencies of their column values until it isdetermined (508) not to reduce the total number of buckets as themaximum allowed number of buckets has been reached. The method 500 canend when it is determined (508) that maximum allowed number of bucketshas been reached.

As noted above, selections of adjacent buckets to be combined can befurther discussed in context of a square error measurement. Generally,The distinct values (e.g., a table column values) can be represented asa finite data sequence X:

X=x ₁ <x ₂ <x ₃ < . . . <x _(n), and

the indicator values (e.g., frequency counts) of these values can beexpressed:

f _(x1) f _(x2) f _(x3) . . . f _(xn).

Let M be the maximal number of buckets in a histogram. M can, forexample, be determined by a database system considering its resourceconsumption or/and computation cost, etc. A bucket (e.g. a histogrambucket) can represent a subsequence of X values,

x _(s) ,x _(s+1) ,x _(s+2) , . . . x _(e),

where x_(s) is the start point of the bucket and x_(e) is the end pointof the bucket. Then the range can be represented by a single point h_(r)of the bucket. Here the h_(r) can, for example, be the average frequencyof all the X values in that range and it is used as an estimate, forexample, for the frequency of each value in x_(s), x_(s+1), x_(s+2), . .. x_(e). Hence, an estimated error for a value can be the differencebetween its actual frequency and h_(r). For example, the error forx_(s+1) is |h_(r)−fx_(s+1)|. In practice, the squared error(h_(r)−fx_(s+1))² is preferred. So the squared error for the values in abucket b_(r) is:

${{SqError}\left( b_{r} \right)} = {\sum\limits_{k = s}^{e}\; \left( {h_{r} - f_{x_{k}}} \right)^{2}}$

A V-Optimal histogram problem is to find a grouping schema for the Mbuckets to minimize the total squared error of the whole histogram:

${Minimize}\mspace{14mu}\left\lbrack {{{SqError}(H)} = {\sum\limits_{r = 1}^{M}\; {\sum\limits_{k = s_{r}}^{e_{r}}\; \left( {h_{r} - f_{k}} \right)^{2}}}} \right\rbrack$

Generally, the smaller the total squared error is, the better thehistogram is. The exhausted search of the global optimal histogram canbe a NP-hard problem because any M−1 out of N distinct values can beselected as the boundaries for the M buckets and all of these possiblechoices need to be examined. It should be noted other criteria can beconsidered by assigning or reassigning error values to affect thelikelihood of combining of values one way or another. For example, basedon a workload or given knowledge of existing distribution of a columnvalue, a user can pre-assign a preliminary error value to be added toone or more specific column values in a database. As a result, thecolumn values would be less likely to be combined with other values inan effort to keep them in their own bucket.

It will be appreciated that the techniques described above areespecially suitable for large database systems that can typically storerelatively large amount of data. Such databases can include largeparallel or multiprocessing database systems that may be comprised ofmultiple database nodes (or nodes), where each node can have its ownprocessor(s) and storage device(s).

To further elaborate, FIG. 6 depicts a database node 1105 of a databasesystem or a Database Management System (DBMS) 1000 in accordance withone embodiment of the invention. The DBMS 1000 can, for example, beprovided as a Teradata Active Data Warehousing System. It should benoted that FIG. 6 depicts in greater detail an exemplary architecturefor one database node 1105 ₁ of the DBMS 1000 in accordance with oneembodiment of the invention.

Referring to FIG. 6, the DBMS node 1105 ₁ includes multiple processingunits (or processing modules) 1110 _(1-N) connected by a network 1115,that manage the storage and retrieval of data in data-storage facilities1120 _(1-N). Each of the processing units 1110 _(1-N) can represent oneor more physical processors or virtual processors, with one or morevirtual processors (e.g., an Access Module Processor (AMP)) running onone or more physical processors in a Teradata Active Data WarehousingSystem). For example, when provided as AMPs, each AMP can receive workphases from a parsing engine (PE) 1130 which is also described below.

In the case in which one or more virtual processors are running on asingle physical processor, the single physical processor swaps betweenthe set of N virtual processors. For the case in which N virtualprocessors are running on an M-processor node, the node's operatingsystem can schedule the N virtual processors to run on its set of Mphysical processors. By way of example, if there are four (4) virtualprocessors and four (4) physical processors, then typically each virtualprocessor could run on its own physical processor. As such, assumingthere are eight (8) virtual processors and four (4) physical processors,the operating system could schedule the eight (8) virtual processorsagainst the four (4) physical processors, in which case swapping of thevirtual processors could occur.

In the database system 1000, each of the processing units 1110 _(1-N)can manage a portion of a database stored in a corresponding one of thedata-storage facilities 1120 _(1-N). Also, each of the data-storagefacilities 1120 _(1-N) can include one or more storage devices (e.g.,disk drives). Again, it should be noted that the DBMS 1000 may includeadditional database nodes 1105 _(2-O) in addition to the database node1105 ₁. The additional database nodes 1105 _(2-O) can be connected byextending the network 1115. Data can be stored in one or more tables inthe data-storage facilities 1120 _(1-N). The rows 1125 _(1-X) of thetables can, for example, be stored across multiple data-storagefacilities 1120 _(1-N) to ensure that workload is distributed evenlyacross the processing units 1110 _(1-N). In addition, a parsing engine1130 can organize the storage of data and the distribution of table rows1125 _(1-Z) among the processing units 1110 _(1-N) The parsing engine1130 can also coordinate the retrieval of data from the data-storagefacilities 1120 _(1-N) in response to queries received, for example,from a user. The DBMS 1000 usually receives queries and commands tobuild tables in a standard format, such as, for example, SQL. Parsingengine 1130 can also handle logons, as well as parsing the SQL requestsfrom users, turning them into a series of work phases that can be sentto be executed by the processing units 1110 _(1-N).

For example, a client-side Host (e.g., a Personal Computer (PC), aserver) can, be used to logon to the database system 1000 provided as aTeradata database server. Commination between the client-side Host andthe database system 1000 can be facilitated by a database communicatingmechanism, for example, by an ANSI CLI (Call Level Interface) standardthat can include parcel requests and responses that facilitate themovement of data resident on the client-side host over to the databasesystem 1000.

For example, the rows 1125 _(1-Z) can be distributed across thedata-storage facilities 1120 _(1-N) by the parsing engine 1130 inaccordance with their primary index. The primary index defines thecolumns of the rows that are used for calculating a hash value. Thefunction that produces the hash value from the values in the columnsspecified by the primary index may be called the hash function. Someportion, possibly the entirety, of the hash value can be designated a“hash bucket”. As such, the hash buckets can be assigned to data-storagefacilities 1120 _(1-N) and associated processing units 1110 _(1-N) by ahash bucket map. The characteristics of the columns chosen for theprimary index determine how evenly the rows are distributed.

Referring again to FIG. 6, it should be noted that a statisticalsummarizer 1002 can be provided as a central component for theprocessing units 1110 _(1-N). However, it should be noted that each oneof the processing units 1110 _(1-N) can be effectively provided with alocal statistical summarizer that can serve as a local component andpossibly collaborate with the central data management system 1002. Ofcourse, various other configurations are possible and will becomereadily apparent in view of the foregoing.

In accordance with one embodiment, V-Optimal Histogram can be provided.It will be appreciated that the V-Optimal Histogram can, for example, beprovided for relatively large tables (e.g., one terabyte tables) with arelatively large number of distinct values in a parallel processingenvironment, such as the database system 1000 (depicted in FIG. 6). Forexample, the process for providing the V-Optimal Histogram can be donein two parts. In the first part, each AMP scans its local table rows tocollect distinct values and their frequencies. The parallel scanning ofthe table rows on multiple AMPs can be done in a conventional manner.Then, in the second part, all the distinct values and their frequencieswill be sent to a master AMP for global aggregation. Thereafter, themaster AMP will build the histogram from the distinct values. Thetechnique for doing can, for example, be performed as follows:

Input: the maximal bucket number M, and the table and column that thehistogram is built for.

Output: The Constrained V-Optimal Histogram

Technique for V-Optimal Histogram:

Phase 1: Each AMP collects the distinct values and their correspondingfrequencies locally.

Phase 2: The local information is sent from every AMP to a selectedmaster AMP for global aggregation. And the distinct values from all AMPs(associated with their corresponding frequencies) are sorted in a liston the increasing order of values. Assume there are totally N distinctvalues.

Phase 3: The master AMP builds the initial histogram buckets where eachbucket contains only one or more immediate neighbor values with the samefrequency. Then we continue on phase 4 and 5 to merge neighbor bucketsfurther until the total number of buckets in the final result is equalto or less than M.

Scan the sorted list starting from the first value X₁. At X₁. the firstbucket b₁ is built with: b₁·start_point=b₁·end_point=X₁,b₁·number_of_values=1, b₁·average_frequency=fx₁, and b₁·squared_error=0.

Then, look ahead the right neighbor of X₁, one by one. As long as thefrequency of the neighbor is the same as fx₁, continue to look aheaduntil encounter: a X_(i+1) with fx_(i+1) not equal to fx₁. When it stopsat X₁₊₁, pack all the values from X₁ to X_(i) into b₁. Now update b₁ to:b₁_end_point=X_(i), and b₁_number_of_values=i.

It is noticed that b₁·average_frequency and b₁·squared_error are keptunchanged. Then, start to look at the value X_(i+1) and build the secondbucket b₂ with: b₂_start_point=b₂, end_point=X_(i+1),b₂·number_of_values=1, and b₂_squared_error=0.

Then, do the same as it were at X₁. All the values following X_(i+1)that have the same frequency fx_(i+1) will be packed into b₂. Suppose westop at X_(j+1), then update b₂ with: b₂·end_point=X_(i), andb₂·number_of_values=j−i.

Similarly, b₂·average_frequency and b₂·squared_error are kept unchanged.Then, continue to do this until scanning of all values is finished.Suppose, we have built M′ buckets and these buckets have actually beenplaced in a list, Result_List, of the increasing order on theirstart_point. If M′<=M, then the work is done and the resulting histogramconsists of the M′ buckets. But if M′>M, then we continue to phase 4below.

Phase 4: The master AMP probes the merge of each pair of neighborbuckets in the current Result_List. Rank the possible merges of thesepairs so that we can start to merge the best candidates, the secondbest, the third best, and so on, in the next phase (phase 5). In therank, the workload-related constraints can be considered. Theworkload-related constraints can, for example, be specified by acustomer of the database, so that the order of the merge can becontrolled. As a result, this may give higher resolution to the bucketsthat contain, for example, “hot” values:

Scan the M′ buckets from the beginning of the Result_List. For any twoimmediate neighbor buckets b_(i) and b_(i+1), a new bucket b_((i, i+1))is built to combine the two by including all their values. The averagefrequency and squared error of b_((i, i+1)) will be calculated from allthe values. At the same time, we will record theb_((i, i+1))·delta_error with:

b _((i,i+1))·delta_error=b _((i,i+1))·squared_error−(b_(i)·squared_error+b _(i+1)·squared_error).

Each bucket b_((i, i+1)) is also assigned a ranking score. As thesimplest case, the ranking score can be defined as delta_error.Basically, if the merge of two buckets can produce a bigger bucket withminimum increment in squared error, then this merge is preferred first.As will be discussed below, this ranking score definition can beenhanced to integrate the user-specified constraints. The phase 4 endsup with a new list, Working_List, of M′−1 new buckets which are sortedin the increasing order of their ranking scores.

Phase 5: The master AMP starts to merge the buckets in the Result_Listas instructed by the bucket at the beginning of the Working_List,because the first bucket in Working_List is considered as the bestcandidate for a merge at that moment. Once the merge is done, thenupdate the Result_List and Working_List to reflect the impact of themerge. Then, continue to process the next top bucket in theWorking_List. This will be repeated until the total number of buckets inResult_List is reduced to M.

Looking at the Working_List, since its buckets are sorted on theincreasing order of their ranking scores, the bucket at the beginning ofWorking_List actually points to the best two candidate buckets in theResult_List for a merge. The second bucket in the Working_List points tothe second best candidate bucket pair in the Result_List for a merge,and so on. Thus, we will start the merge process with the first bucketin the Working_List.

-   -   5.1 Suppose the bucket b_((i, i+1)) is the first bucket in        Working_List currently, take it off from the Working_List. It        indicates us that the two best candidate buckets will be b_(i)        and b_(i+1) in the Result_List for a merge. After we take off        b_((i, i+1)), the second bucket in the Working_List will pop to        the top. Update the Result_List by replacing the two candidate        buckets b_(i) and b_(i+1) with the new bigger bucket        b_((i, i+1)).    -   5.2 It is noticed that in Phase 4 when we build the        Working_List, the bucket b_(i) in the Result_List might be used        twice to build new buckets b_((i−1, i)) and b_((i, i+1)).        Similarly, b_(i+1) might be used twice to build new buckets        b_((i, i+1)) and b_((i+1, i+2)). We also need to take off        b_((i−1, i)) and b_((i+1, i+2)) from the Working_List too        because our merge has affected the information of the two. Now        we look at the buckets b_(i−1), b_((i, i+1)), and b_(i+2) in the        Result_List. The buckets b_(i−1) and b_((i, i+1)) will be        combined to build a new bucket to replace the old b_((i−1, i)).        And the buckets b_((i, i+1)) and b_(i+2) will be combined to        build another new bucket to replace the old b_((i+1, i+2)). For        each of the two new buckets, we need to re-insert it into the        sorted Working_List. When we insert the two new buckets into        Working_List, we also assign appropriate ranking scores to each        of them. The ranking scores will account for the customer's        workload-related constraints to control where to insert them. It        is noticed that the position of a bucket in the Working_List        will determine when it is considered as a good candidate for a        merge.    -   5.3 Repeat sub-phase 5.1 to 5.3 until the total number of        buckets in the Result List reaches M.

Phase 6: The master AMP returns the Result_List as the ConstrainedV-Optimal Histogram and save it in the database dictionary.

The initial sorting of the Result_List (phase 2) and Working_List (phase4) will be O(N log N). Then at most (M′−M<N) merges will happen in phase5. Each merge may require updating Working_List and Result_List. Withthe help of advanced structures like B-tree or maxheap, each of theseupdates is expected to be done at an average computation cost of O(logN). Thus, the total computation complexity of CVOH will be O(N log N).

Collect Statistics—CVOH with Workload-Related Constraints

In some cases the database environments and its data may be well known.For example, in many cases in real life, users (especiallyadministrators) of databases may know the workload of their applicationsvery well. Today, there are also utilities available to help databaseusers determine the characteristics of a specific workload. As oneexample, a database user can attempt to collect statistics on a columncol_1 of table tab_1. In this case, the user may know that the tab_1 isoften joined to a very big table tab_2 in an application, and the joincondition is “tab_1·col_1=tab_2·col_2.” The database user may also knowthat most rows in tab_2 have column col_2 values in the range between 10and 20. This means any significant estimation error for col_1 values inthe range between 10 and 20 could seriously hinder the determination ofa cardinality estimation of the join. In this case, the database usercan build a CVOH on tab_1·col_1 so that higher resolution is especiallygiven to the values between 10 and 20 in the histogram in accordancewith one embodiment. This can help the optimizer to improve itsestimation accuracy. In other words, the database user is able to usethe knowledge about the database to build a histogram which is optimizedfor a particular database query or database workload. In one embodiment,the database user can, for example, submit an enhanced “COLLECTSTATISTICS” statement like the below:

COLLECT STATISTICS ON tab_1 COLUMN col_1

HISTOGRAM CVOH

-   -   CONSTRAINTS (MAX ESTIMATE ERROR PERCENTAGE 20% WHEN col_1        BETWEEN 10 AND 20)

This statement can instruct a database system that when CVOH for col_1is constructed, if a bucket has already contained values in the rangebetween 10 and 20, then the merge between it and any other buckets needsto be evaluated against the constraints. If the evaluation resultconflicts with the constraints, the ranking score assigned to the bucketrepresenting that merge will be adjusted, for example, from a default“delta_error” value to a very high value. As a result, all such bucketscan be placed somewhere close to the end of a Working_List by thesorting and insertion operations. Then, the construction algorithm cantry to explore other merge possibility first. Only when there is noother choice and the number of buckets is still bigger than M, the mergeof this bucket with others can be considered.

It should be noted that if the constraints correspond to a group ofindividual values to be held off from the merge, then the CVOH will besimilar to HBH; they both use a group of buckets to save the individualhigh-biased values and their frequency. The minor difference is thatCVOH uses V-Optimal Histogram for the non-high-biased values but HBHuses Equal-depth. Thus, HBH can be considered as a special case of CVOH.

Integrate Workload-Related Constraints into CVOH

Referring to phase 4:

-   -   When we build any bucket b_((i, i+1)) for the Working_List,        consider the constraint. For every value x₁ contained in the        bucket b_((i, i+1)), if x_(j) is in the range between 10 and 20,        then find its frequency fx_(j) and check if the condition below        is true:

|fx _(j) −b _((i,i+1))·average_frequency|/fx _(j)>20%

If yes, the constraint will be violated by the merge represented by thebucket b_((i, i+1)). Thus, the ranking score of bucket b_((i, i+1)) willbe adjusted to very high like:

(delta_error+HIGH_RANK_SCORE_THRESHOLD)

where HIGH_RANK_SCORE_THRESHOLD can a very large constant. The sortinglogic of the Working_List will intentionally place it somewhere close tothe end of the Working_List. As a result, it will be merged last. Inother words, an additional error value can be added to the error valueassociated with the one or more database values that are not desired tobe combined with any or at least one or more other database values,thereby reducing the likelihood of combining that the one or more otherdatabase values with the one or more database values. The one or moredatabase values can, for example, be hot values. As another example, aworkload constraint can be integrated with a combining strategy forcombining buckets, where a preliminary constant error value can be addedto the delta-error to avoid combining the one or more other databasevalues.

Referring to Phase 5:

Similarly, whenever we build the two new buckets accordingly for acompleted merge and insert them back to the Working_List, we also needto check every value contained in these new buckets against theconstraints. If the constraint is violated, the new bucket will beassigned with a ranking score equal to(delta_error+HIGH_RANK_CORE_THRESHOLD) too, and thus be placed to end ofthe Working_List.

In view of the foregoing, it will be appreciated that a parallel DBMScan efficiently build a V-Optimal Histogram in O (N log N) in accordancewith one embodiment. V-Optimal Histogram can be better than otherstate-of-art histograms in terms of accuracy. This can improve theaccuracy of cardinality or selectivity estimation during theoptimization phase. As a result, the overall query performance can beenhanced. In addition, users of databases can build a V-OptimalHistogram for their data according to the specific characteristics ofdata access patterns (e.g., a specific workload). As a result, thehistogram generated by CVOH can further improve cardinality orselectivity estimation.

Referring now to FIG. 7, in one exemplary system, the parsing engine1130 can be made up of three components: a session control 1200, aparser 1205, and a dispatcher 1210. In the example, the session control1200 provides the logon and logoff function. It accepts a request forauthorization to access the database, verifies it, and then eitherallows or disallows the access. When the session control 1200 allows asession to begin, a user may submit a SQL request, which is routed tothe parser 1205. Regarding the dispatcher 1210, it should be noted thatsome monitoring functionality for data management and/or workloadmanagement may be performed by a regulator to monitor workloads andusage of the resources, for example, by using internal messages sentfrom the AMPs to the dispatcher 1210. The dispatcher 1210 can provide aninternal status of every session and request running on the system, forexample, by using internal messages sent from the AMPs to the dispatcher1210. In the example, the dispatcher 1210 can provide an internal statusof every session and request running on the system. As such, at leastpart of a database management can be provided by the dispatcher 1210 inaccordance with one embodiment of the invention. The dispatcher 1210 canalso operate as a workload dispatcher in order to effectively manageworkloads. As such, at least part of data management system can beprovided by the dispatcher 1210 in accordance with one embodiment of theinvention.

As illustrated in FIG. 8, the parser 1205 interprets the SQL request1300, checks it for proper SQL syntax 1305, evaluates it semantically1310, and consults a data dictionary to ensure that all of the objectsspecified in the SQL request actually exist and that the user has theauthority to perform the request 1305. Finally, the parser 1205 runs anoptimizer 1320, which can generate the least expensive plan to performthe request.

Generally, various aspects, features, embodiments or implementations ofthe invention described above can be used alone or in variouscombinations. Furthermore, implementations of the subject matter and thefunctional operations described in this specification can be implementedin digital electronic circuitry, or in computer software, firmware, orhardware, including the structures disclosed in this specification andtheir structural equivalents, or in combinations of one or more of them.Implementations of the subject matter described in this specificationcan be implemented as one or more computer program products, i.e., oneor more modules of computer program instructions encoded on a computerreadable medium for execution by, or to control the operation of, dataprocessing apparatus. The computer readable medium can be amachine-readable storage device, a machine-readable storage substrate, amemory device, a composition of matter affecting a machine-readablepropagated signal, or a combination of one or more of them. The term“data processing apparatus” encompasses all apparatus, devices, andmachines for processing data, including by way of example a programmableprocessor, a computer, or multiple processors or computers. Theapparatus can include, in addition to hardware, code that creates anexecution environment for the computer program in question, e.g., codethat constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them. A propagated signal is an artificially generated signal, e.g.,a machine-generated electrical, optical, or electromagnetic signal thatis generated to encode information for transmission to suitable receiverapparatus.

A computer program (also known as a program, software, softwareapplication, script, or code) can be written in any form of programminglanguage, including compiled or interpreted languages, and it can bedeployed in any form, including as a standalone program or as a module,component, subroutine, or other unit suitable for use in a computingenvironment. A computer program does not necessarily correspond to afile in a file system. A program can be stored in a portion of a filethat holds other programs or data (e.g., one or more scripts stored in amarkup language document), in a single file dedicated to the program inquestion, or in multiple coordinated files (e.g., files that store oneor more modules, subprograms, or portions of code). A computer programcan be deployed to be executed on one computer or on multiple computersthat are located at one site or distributed across multiple sites andinterconnected by a communication network.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for performing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto-optical disks, or optical disks. However, a computerneed not have such devices. Moreover, a computer can be embedded inanother device, e.g., a mobile telephone, a personal digital assistant(PDA), a mobile audio player, a Global Positioning System (GPS)receiver, to name just a few. Computer readable media suitable forstoring computer program instructions and data include all forms ofnonvolatile memory, media and memory devices, including by way ofexample semiconductor memory devices, e.g., EPROM, EEPROM, and flashmemory devices; magnetic disks, e.g., internal hard disks or removabledisks; magneto optical disks; and CDROM and DVD-ROM disks. The processorand the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry.

To provide for interaction with a user, implementations of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech,tactile or near-tactile input.

Implementations of the subject matter described in this specificationcan be implemented in a computing system that includes a backendcomponent, e.g., as a data server, or that includes a middlewarecomponent, e.g., an application server, or that includes a frontendcomponent, e.g., a client computer having a graphical user interface ora Web browser through which a user can interact with an implementationof the subject matter described in this specification, or anycombination of one or more such backend, middleware, or frontendcomponents. The components of the system can be interconnected by anyform or medium of digital data communication, e.g., a communicationnetwork. Examples of communication networks include a local area network(“LAN”) and a wide area network (“WAN”), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specifics, these should not beconstrued as limitations on the scope of the disclosure or of what maybe claimed, but rather as descriptions of features specific toparticular implementations of the disclosure. Certain features that aredescribed in this specification in the context of separateimplementations can also be implemented in combination in a singleimplementation. Conversely, various features that are described in thecontext of a single implementation can also be implemented in multipleimplementations separately or in any suitable sub-combination. Moreover,although features may be described above as acting in certaincombinations and even initially claimed as such, one or more featuresfrom a claimed combination can in some cases be excised from thecombination, and the claimed combination may be directed to asub-combination or variation of a sub-combination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the implementations described above should not beunderstood as requiring such separation in all implementations, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

The various aspects, features, embodiments or implementations of theinvention described above can be used alone or in various combinations.The many features and advantages of the present invention are apparentfrom the written description and, thus, it is intended by the appendedclaims to cover all such features and advantages of the invention.Further, since numerous modifications and changes will readily occur tothose skilled in the art, the invention should not be limited to theexact construction and operation as illustrated and described. Hence,all suitable modifications and equivalents may be resorted to as fallingwithin the scope of the invention.

1. A method of storing database values and their associated indicatorvalues in a summarized form in a database that stores data, wherein themethod is implemented at least partly by a device, and wherein themethod comprises: arranging multiple buckets that include multipledatabase values of a database in accordance with an order in anarrangement based on the database values, wherein each one of thedatabase values is associated with an indicator value, and wherein eachone of the multiple buckets includes only one of the database valueswith its associated indicator value; determining whether to reduce thenumber of buckets arranged in the arrangement; combining two adjacentbuckets in the arrangement into a combined bucket at least partly basedthe difference between the indicator values if the two adjacent bucketswhen the determining determines to reduce the number of buckets; andrepeating the determining of whether to reduce the number of buckets andthe combining of yet another two adjacent buckets in the arrangementinto another combined bucket until the determining determines not tofurther reduce the number of buckets.
 2. The method of claim 1, whereinthe database values are column values of a database table of a database,and wherein their associated indicator values are their frequency ofoccurrence in the database table; wherein determining whether to reducethe number of buckets arranged in the arrangement includes determiningwhether two frequencies of occurrence of two adjacent buckets are withinan acceptable range of difference, and wherein the determining ofwhether reduce the number of buckets includes determining whether adesired total number of buckets has been reached, and wherein therepeating repeats the combining for yet another two adjacent bucketsuntil the determining determines not to further reduce the number ofbuckets as the desired total number of buckets has been reached.
 3. Themethod of claim 1, wherein the one or more criteria include an errorcriterion determined based on the difference between two or more of theindicator values.
 4. The method of claim 1, wherein the desired totalnumber of buckets is provided as input.
 5. The method of claim 1,wherein the one or more criteria include the desirability for notcombining one or more of the database values with one or more otherdatabase values of the database values.
 6. The method of claim 5,wherein the method further comprises: adding an additional error valueto a default error value associated with the one or more database valuesthat are not desired to be combined with the one or more other databasevalues of the database values, thereby reducing the likelihood ofcombining that the one or more database values with the one or moreother database values.
 7. The method of claim 5, wherein the methodfurther comprises: integrating a workload constraint into a combiningstrategy for combining the buckets; and adding a preliminary constanterror value to a default delta-error associated with the one or moredatabase values in an attempt to avoid combining the one or moredatabase values with the one more other database values.
 8. The methodof claim 2, wherein the method further comprises: determining thedesired total number of buckets by considering one or more of thefollowing: cost of memory, storage, computational resources formaintaining a histogram, and input provided by a database administratorand/or database user.
 9. The method of claim 1, wherein the methodfurther comprises: receiving as input through a user interface thedesired total number of buckets and the one or more criteria.
 10. Anapparatus that includes one or more processors operable to storedatabase values and their associated indicator values in a summarizedform, by performing at least the following: arranging multiple bucketsthat include multiple database values of a database in accordance withan order in an arrangement based on the database values, wherein eachone of the database values is associated with an indicator value, andwherein each one of the multiple buckets includes only one of thedatabase values with its associated indicator value; determining whetherto reduce the number of buckets arranged in the arrangement; combiningtwo adjacent buckets in the arrangement into a combined bucket at leastpartly based the difference between the indicator values if the twoadjacent buckets when the determining determines to reduce the number ofbuckets; and repeating the determining of whether to reduce the numberof buckets and the combining of yet another two adjacent buckets in thearrangement into another combined bucket until the determiningdetermines not to further reduce the number of buckets.
 11. Theapparatus of claim 10, wherein the determining of whether reduce thenumber of buckets determines whether a desired total number of bucketshas been reached, and wherein the repeating repeats the combining foryet another two adjacent buckets until the determining determines not tofurther reduce the number of buckets as the desired total number ofbuckets has been reached.
 12. The apparatus of claim 10, wherein the oneor more criteria include an error criterion determined based on thedifference between two or more of the indicator values.
 13. Theapparatus of claim 10, wherein the desired total number of buckets isprovided as input.
 14. The apparatus of claim 10, wherein the one ormore criteria includes the desirability for not combining one or more ofthe database values with one or more other database values of thedatabase values.
 15. The apparatus of claim 10, wherein the storing ofthe database values and their associated indicator values in asummarized form further comprises: integrating a workload constraintinto a combining strategy for combining the buckets; and adding apreliminary constant error value to a default delta-error associatedwith the one or more database values in an attempt to avoid combiningthe one or more database values with the one more other database values.16. A non-transitory computer readable storage medium storing at leastcomputer code that when execute stores database values and theirassociated indicator values in a summarized form by at least: arrangingmultiple buckets that include multiple database values of a database inaccordance with an order in an arrangement based on the database values,wherein each one of the database values is associated with an indicatorvalue, and wherein each one of the multiple buckets includes only one ofthe database values with its associated indicator value; determiningwhether to reduce the number of buckets arranged in the arrangement;combining two adjacent buckets in the arrangement into a combined bucketat least partly based the difference between the indicator values if thetwo adjacent buckets when the determining determines to reduce thenumber of buckets; and repeating the determining of whether to reducethe number of buckets and the combining of yet another two adjacentbuckets in the arrangement into another combined bucket until thedetermining determines not to further reduce the number of buckets. 17.The non-transitory computer readable storage medium of claim 16, whereinthe determining of whether reduce the number of buckets determineswhether a desired total number of buckets has been reached, and whereinthe repeating repeats the combining for yet another two adjacent bucketsuntil the determining determines not to further reduce the number ofbuckets as the desired total number of buckets has been reached.
 18. Thenon-transitory computer readable storage medium of claim 16, wherein theone or more criteria include an error criterion determined based on thedifference between two or more of the indicator values.
 19. Thenon-transitory computer readable storage medium of claim 16, wherein thedesired total number of buckets is provided as input.
 20. Thenon-transitory computer readable storage medium of claim 16, wherein theone or more criteria includes the desirability for not combining one ormore of the database values with one or more other database values ofthe database values.